When you start a project is good practice to use here::i_am("<TypeProjectName>") to set up the top-level project directory. The base-R command to do that is setwd(), but this is fragile as it is operating system dependent as well as filesystem (absdolute and relative) dependent.
More info: https://here.r-lib.org
We will use meteorological data set from National Oceanic and Atmospheric Administration (NOAA). NOAA is an agency under the U.S. Dept. of Commerce that provides lots of data and information related to Climate. Go take a look at: https://www.noaa.gov.
Given we are in the San Francisco Bay Area, we might want to work on some local meteorological data. I’ve selected the Station FTPC1 - 9414290 - San Francisco, CA (37.806 N 122.466 W) and prepared a time-series containing measurement on: - Wind Direction (“WDIR”) - Wind Speed (“WSPD”) - Atmospheric Pressure (“PRES”) - Atmospheric Temperature (“ATMP”)
The time-series is in week3/data directory and the file name is meteo_SF_FTPC1_2011-2020.rds. A rds file is a native data formats in R. It is a serialized object that, compared to a csv occupy less space on drive.
Import the file in memory and assign it into the variable meteo_df:
#meteo_df <- read_rds(file = "week3/data/meteo_SF_FTPC1_2011-2020.rds")
meteo_df <- read_rds(file = here("week3", "data", "meteo_SF_FTPC1_2011-2020.rds"))
Note that we used the function read_rds with the file-name as the function argument. The file-name is the output of the function here(). Which are the argument in this case?
Once data are imported we can check some of the data-set characteristics:
Use dim() command to get information on meteo_df dimensions:
dim(meteo_df)
## [1] 870504 5
The output of the command gives you the number of rows and columns. Remember that you can execute ?dim in the Console to open the Help panel.
To check the first (default is 6) rows of meteo_df we might use head():
head(meteo_df)
From the output we can get some useful information:
From the columns’ name we can guess to which environmental variable they correspond (e.g. WDIR could be Wind Direction). But we can not get from the output:
To see when the data-set ends we can use the tail() command:
tail(meteo_df)
The command works in a similar to head() but showing the last rows of the data.
An useful command to look into the data structure:
str(meteo_df)
## tibble [870,504 × 5] (S3: tbl_df/tbl/data.frame)
## $ TIME: chr [1:870504] "2011-01-01 00:00:00" "2011-01-01 00:06:00" "2011-01-01 00:12:00" "2011-01-01 00:18:00" ...
## $ WDIR: num [1:870504] 3 3 1 8 8 12 11 19 8 13 ...
## $ WSPD: num [1:870504] 4.3 4.3 4.6 4.4 4.4 3.9 3.2 3.4 3.1 3.5 ...
## $ PRES: num [1:870504] 1018 1018 1018 1018 1018 ...
## $ ATMP: num [1:870504] 8.2 8.2 8.2 8.1 8.1 8.1 8.2 8.3 8.2 8 ...
Another useful command to inspect the data-set:
glimpse(meteo_df) # this is a nice command that we will use a lot
## Rows: 870,504
## Columns: 5
## $ TIME <chr> "2011-01-01 00:00:00", "2011-01-01 00:06:00", "2011-01-01 00:12:0…
## $ WDIR <dbl> 3, 3, 1, 8, 8, 12, 11, 19, 8, 13, 7, 358, 358, 14, 7, 8, 12, 12, …
## $ WSPD <dbl> 4.3, 4.3, 4.6, 4.4, 4.4, 3.9, 3.2, 3.4, 3.1, 3.5, 3.9, 4.1, 4.1, …
## $ PRES <dbl> 1018.4, 1018.4, 1018.3, 1018.2, 1018.2, 1018.1, 1017.9, 1017.9, 1…
## $ ATMP <dbl> 8.2, 8.2, 8.2, 8.1, 8.1, 8.1, 8.2, 8.3, 8.2, 8.0, 8.0, 7.9, 7.9, …
To produce a table with a summary of each data variables we can use summary(). The function invokes particular methods which depend on the class of the first argument.
summary(meteo_df)
## TIME WDIR WSPD PRES
## Length:870504 Min. : 0.0 Min. :0.000 Min. : 987.1
## Class :character 1st Qu.:197.0 1st Qu.:1.800 1st Qu.:1013.8
## Mode :character Median :238.0 Median :3.100 Median :1016.7
## Mean :239.7 Mean :3.457 Mean :1284.8
## 3rd Qu.:259.0 3rd Qu.:4.700 3rd Qu.:1020.5
## Max. :999.0 Max. :9.900 Max. :9999.0
## ATMP
## Min. : 1.8
## 1st Qu.: 11.4
## Median : 13.0
## Mean : 52.6
## 3rd Qu.: 14.9
## Max. :999.0
A data without metadata is almost useless… Metadata is data that provides information about other data but not the content of the data. Take a look at the wiki: https://en.wikipedia.org/wiki/Metadata.
Looking at the summary() output we can note that:
We need to know more about:
To get these information we need to get back to the NOAA Station webpage and check data descriptions: https://www.ndbc.noaa.gov/measdes.shtml
From the link we can see that:
NA or missing valuesIn the following part of the class we will only use TIME and TEMPERATURE. We don’t need to carry around the other variables (for now…). Use the tidyverse sintax to subset the meteo_df and get a new data frame. Assign the new data frame to the variable temperature_df.
*Remind: we will use the %>% (known as the pipe) operator and the function select(). You can read the following code as: from meteo_df select variable TIME and ATMP and assign them to a new object called tempearature_df:
temperature_df <- meteo_df %>%
select(TIME, ATMP)
head(temperature_df)
The Coordinated Universal Time or UTC is the primary time standard by which the world regulates clocks and time. The Reference Meridian
https://en.wikipedia.org/wiki/Coordinated_Universal_Time#/media/File:World_Time_Zones_Map.png
Calculating time for navigation..
In 1714, the British government offered a longitude prize for a method of determining longitude at sea, with the awards ranging from ~£20,000 (~$2.6 million to ~$5.2 million in 2022 terms) depending on accuracy.
From 1730 to 1761 John Harrison worked on the creation of the first chronometer to measure the time differences between anywhere on the Earth and the Greenwich Observatory.
He solved the greatest proble of his time.. indeed.
lubridate to work with Time in RSo.. we should be ready to do some analysis. But first we need to tell R to consider TIME, still a vector of character strings, as a date/time object.
Date/time is very annoying for a computer.. how much is \(365 + 1\)?. It is not trivial.. is 366? 1?.. actually can be both if we are counting days of the year: leap year, occurring every 4 years (The Julian year is 365.25 days).
Does every day has 24 hr?
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and Data Saving Time (DST).
The lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. If we already imported the library good otherwise type library(lubridate) in the Console to load the package’s functions in memory.
With lubridate we can add to R two new class:
date, printed as <date>date-time, more complicated objects, that uniquely defines an instant in time <dtty>As you may have guessed.. we will use <dtty> objects!
We can use the ymd() family to convert a string to a date object:
dumb_data <- "2022-04-22"
typeof(dumb_data)
## [1] "character"
dumb_data <- lubridate::ymd(dumb_data)
print(dumb_data)
## [1] "2022-04-22"
typeof(dumb_data) # What we expect as output?
## [1] "double"
class(dumb_data) # It is a new class! Not a new type!
## [1] "Date"
Why is a double? We just said that it was a date… date/times as numeric offsets from the “Unix Epoch”, 1970-01-01… with offsets that can be seconds (for dtty) or days (for date).
as.numeric(ymd("2022-04-22"))
## [1] 19104
ymd("2022-04-22") - 19104
## [1] "1970-01-01"
as_date(19104)
## [1] "2022-04-22"
as_datetime(19104)
## [1] "1970-01-01 05:18:24 UTC"
What is the number of offset of now!? as.numeric(now())
as.numeric(ymd_hms("2022-04-22 15:00:00")) #as.numeric(now())
## [1] 1650639600
print(2 ^ .Machine$double.digits)
## [1] 9.007199e+15
as_datetime(2 ^ .Machine$double.digits) # when the world will finish (according to R) #options(digits = 22) #it is 53 default?
## [1] "285428751-11-12 07:36:32 UTC"
as_datetime(-1 * (2 ^ .Machine$double.digits)) # the genesis (according to R)
## [1] "-285424812-02-20 16:23:28 UTC"
Check https://en.wikipedia.org/wiki/Future_of_Earth.. at least for few years we should be good with this system. Later on some more critical problems may arise..
Imagine that a real assh*** save a dtty object in a very uncommon way…
mdy("January 31st, 2017")
## [1] "2017-01-31"
dym("31, 2017, 1")
## [1] "2017-01-31"
dym("31, 2017, Jan")
## [1] "2017-01-31"
dym("31-2017;:-#ùì1")
## [1] "2017-01-31"
make_date(year = 2017, month = 01, day = 31)
## [1] "2017-01-31"
make_datetime(2017,01,31,12,00,00)
## [1] "2017-01-31 12:00:00 UTC"
Some time is necessary to get data/time components:
year()month()yday()mday()wday()hour()minute()second()dumb_datetime <- ymd_hms("2022-04-22 15:34:56")
yday(dumb_datetime)
## [1] 112
wday(dumb_datetime)
## [1] 6
wday(dumb_datetime, label = TRUE)
## [1] Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
wday(dumb_datetime, label = TRUE, abbr = FALSE)
## [1] Friday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
We can use the following functions to “round” the data:
floor_date()round_date()ceiling_date()dumb_datetime <- ymd_hms("2022-04-22 15:34:56")
floor_date(dumb_datetime, unit = "year")
## [1] "2022-01-01 UTC"
floor_date(dumb_datetime, unit = "week")
## [1] "2022-04-17 UTC"
wday(floor_date(dumb_datetime, unit = "week"))
## [1] 1
wday(floor_date(dumb_datetime, unit = "week"), label = TRUE)
## [1] Sun
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
rm(dumb_datetime)
Change part of a date-time object and update():
(datetime <- ymd_hms("2016-07-08 12:34:56"))
## [1] "2016-07-08 12:34:56 UTC"
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"
month(datetime) <- 01
datetime
## [1] "2020-01-08 12:34:56 UTC"
hour(datetime) <- hour(datetime) + 1
datetime
## [1] "2020-01-08 13:34:56 UTC"
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"
How old is Fede?
today() - ymd(19890117)
## Time difference of 12148 days
as.duration(today() - ymd(19890117))
## [1] "1049587200s (~33.26 years)"
We can use different durations (calculated in seconds.. behavior can be confusing when DST):
dseconds()dminutes()dhours()ddays()dweeks()dyears()We can use periods: time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months.
Now that we have introduced lubridate we are ready to work on the data-set! Remember that we have to deal with temperature NAs!
ymd_hms()We know that the TIME is in UTC, let’s use lubridate to change the class of the TIME character string to dtty:
temperature_df <- temperature_df %>%
mutate(TIME = lubridate::ymd_hms(TIME, tz = "UTC"))
na_if()We also know that some ATMP values are NAs. Use the function na_if to mutate them to NA so they can be handled by R:
temperature_df <- temperature_df %>%
mutate(ATMP = na_if(x = ATMP, y = 999))
summary(temperature_df)
## TIME ATMP
## Min. :2011-01-01 00:00:00 Min. : 1.80
## 1st Qu.:2013-06-30 11:04:30 1st Qu.:11.30
## Median :2015-12-29 13:57:00 Median :12.90
## Mean :2015-12-31 00:24:33 Mean :12.99
## 3rd Qu.:2018-06-30 22:01:30 3rd Qu.:14.60
## Max. :2020-12-31 23:54:00 Max. :34.40
## NA's :34964
Now that we have the correct class for TIME and that the summary table for ATMP is sound it is a good idea to start plotting some part of the data-set.
With such huge dataset some sanity check are mandatory before starting a real analysis…
We can plot the temperature of a single day.. plotting the entire dataframe will not tell much… for instance the 1st January 2020:
temperature_df %>%
filter(TIME >= "2020-01-01",
TIME < "2020-01-02") %>%
ggplot(data = ., aes(x = TIME, y = ATMP)) +
geom_line() +
theme_classic()
We note that:
with_tz() considering local timeData are in UTC! So we want to change time-zone to the Pacific Time Zone (PT). We will use the with_tz() function:
temperature_df <- temperature_df %>%
mutate(TIME_LOCAL = lubridate::with_tz(TIME, tzone = "America/Los_Angeles")) %>%
glimpse()
## Rows: 870,504
## Columns: 3
## $ TIME <dttm> 2011-01-01 00:00:00, 2011-01-01 00:06:00, 2011-01-01 00:12…
## $ ATMP <dbl> 8.2, 8.2, 8.2, 8.1, 8.1, 8.1, 8.2, 8.3, 8.2, 8.0, 8.0, 7.9,…
## $ TIME_LOCAL <dttm> 2010-12-31 16:00:00, 2010-12-31 16:06:00, 2010-12-31 16:12…
Here we
The list of time-zone:
To check the time-zone from your console run grep("America", OlsonNames(), value=TRUE) in the R Console.
Check if now the TIME and TEMPERATURE are reflecting local time:
temperature_df %>%
select(TIME_LOCAL, ATMP) %>%
filter(TIME_LOCAL >= "2020-01-01",
TIME_LOCAL < "2020-01-02") %>%
ggplot(data = ., aes(x = TIME_LOCAL, y = ATMP)) +
geom_line() +
theme_classic()
Oh! We are very happy now!
duplicated() rows checkYou may not expect to have to check this.. but local time is affected by DST!
Why we are checking as.character()? Guess..
There are duplicated values at the same months… every year… the same hour! Indeed.. on local DST there are one missing hour and one duplicated hour every year.
temperature_df$TIME[duplicated(as.character(temperature_df$TIME))]
## POSIXct of length 0
temperature_df$TIME_LOCAL[duplicated((temperature_df$TIME_LOCAL))]
## POSIXct of length 0
temperature_df$TIME_LOCAL[duplicated(as.character(temperature_df$TIME_LOCAL))] %>% knitr::kable() #use view()
| x |
|---|
| 2011-11-06 01:00:00 |
| 2011-11-06 01:06:00 |
| 2011-11-06 01:12:00 |
| 2011-11-06 01:18:00 |
| 2011-11-06 01:24:00 |
| 2011-11-06 01:30:00 |
| 2011-11-06 01:36:00 |
| 2011-11-06 01:42:00 |
| 2011-11-06 01:48:00 |
| 2011-11-06 01:54:00 |
| 2012-11-04 01:00:00 |
| 2012-11-04 01:06:00 |
| 2012-11-04 01:12:00 |
| 2012-11-04 01:18:00 |
| 2012-11-04 01:24:00 |
| 2012-11-04 01:30:00 |
| 2012-11-04 01:36:00 |
| 2012-11-04 01:42:00 |
| 2012-11-04 01:48:00 |
| 2012-11-04 01:54:00 |
| 2013-11-03 01:00:00 |
| 2013-11-03 01:06:00 |
| 2013-11-03 01:12:00 |
| 2013-11-03 01:18:00 |
| 2013-11-03 01:24:00 |
| 2013-11-03 01:30:00 |
| 2013-11-03 01:36:00 |
| 2013-11-03 01:42:00 |
| 2013-11-03 01:48:00 |
| 2013-11-03 01:54:00 |
| 2014-11-02 01:00:00 |
| 2014-11-02 01:06:00 |
| 2014-11-02 01:12:00 |
| 2014-11-02 01:18:00 |
| 2014-11-02 01:24:00 |
| 2014-11-02 01:30:00 |
| 2014-11-02 01:36:00 |
| 2014-11-02 01:42:00 |
| 2014-11-02 01:48:00 |
| 2014-11-02 01:54:00 |
| 2015-11-01 01:00:00 |
| 2015-11-01 01:06:00 |
| 2015-11-01 01:12:00 |
| 2015-11-01 01:18:00 |
| 2015-11-01 01:24:00 |
| 2015-11-01 01:30:00 |
| 2015-11-01 01:36:00 |
| 2015-11-01 01:42:00 |
| 2015-11-01 01:48:00 |
| 2015-11-01 01:54:00 |
| 2016-11-06 01:00:00 |
| 2016-11-06 01:06:00 |
| 2016-11-06 01:12:00 |
| 2016-11-06 01:18:00 |
| 2016-11-06 01:24:00 |
| 2016-11-06 01:30:00 |
| 2016-11-06 01:36:00 |
| 2016-11-06 01:42:00 |
| 2016-11-06 01:48:00 |
| 2016-11-06 01:54:00 |
| 2017-11-05 01:00:00 |
| 2017-11-05 01:06:00 |
| 2017-11-05 01:12:00 |
| 2017-11-05 01:18:00 |
| 2017-11-05 01:24:00 |
| 2017-11-05 01:30:00 |
| 2017-11-05 01:36:00 |
| 2017-11-05 01:42:00 |
| 2017-11-05 01:48:00 |
| 2017-11-05 01:54:00 |
| 2018-11-04 01:00:00 |
| 2018-11-04 01:06:00 |
| 2018-11-04 01:12:00 |
| 2018-11-04 01:18:00 |
| 2018-11-04 01:24:00 |
| 2018-11-04 01:30:00 |
| 2018-11-04 01:36:00 |
| 2018-11-04 01:42:00 |
| 2018-11-04 01:48:00 |
| 2018-11-04 01:54:00 |
| 2019-11-03 01:00:00 |
| 2019-11-03 01:06:00 |
| 2019-11-03 01:12:00 |
| 2019-11-03 01:18:00 |
| 2019-11-03 01:24:00 |
| 2019-11-03 01:30:00 |
| 2019-11-03 01:36:00 |
| 2019-11-03 01:42:00 |
| 2019-11-03 01:48:00 |
| 2019-11-03 01:54:00 |
| 2020-11-01 01:00:00 |
| 2020-11-01 01:06:00 |
| 2020-11-01 01:12:00 |
| 2020-11-01 01:18:00 |
| 2020-11-01 01:24:00 |
| 2020-11-01 01:30:00 |
| 2020-11-01 01:36:00 |
| 2020-11-01 01:42:00 |
| 2020-11-01 01:48:00 |
| 2020-11-01 01:54:00 |
temperature_df %>%
filter(TIME_LOCAL >= "2020-03-08 01:30:00",
TIME_LOCAL < "2020-03-08 04:00:00") %>%
#view()
knitr::kable()
| TIME | ATMP | TIME_LOCAL |
|---|---|---|
| 2020-03-08 09:30:00 | 7.9 | 2020-03-08 01:30:00 |
| 2020-03-08 09:36:00 | 7.8 | 2020-03-08 01:36:00 |
| 2020-03-08 09:42:00 | 7.8 | 2020-03-08 01:42:00 |
| 2020-03-08 09:48:00 | NA | 2020-03-08 01:48:00 |
| 2020-03-08 09:54:00 | 7.5 | 2020-03-08 01:54:00 |
| 2020-03-08 10:00:00 | 7.6 | 2020-03-08 03:00:00 |
| 2020-03-08 10:06:00 | 7.3 | 2020-03-08 03:06:00 |
| 2020-03-08 10:12:00 | 7.4 | 2020-03-08 03:12:00 |
| 2020-03-08 10:18:00 | NA | 2020-03-08 03:18:00 |
| 2020-03-08 10:24:00 | 7.3 | 2020-03-08 03:24:00 |
| 2020-03-08 10:30:00 | NA | 2020-03-08 03:30:00 |
| 2020-03-08 10:36:00 | 7.1 | 2020-03-08 03:36:00 |
| 2020-03-08 10:42:00 | 7.3 | 2020-03-08 03:42:00 |
| 2020-03-08 10:48:00 | 7.4 | 2020-03-08 03:48:00 |
| 2020-03-08 10:54:00 | 7.3 | 2020-03-08 03:54:00 |
You can see that there is not 2AM!
temperature_df %>%
filter(TIME_LOCAL >= "2020-11-01 00:30:00",
TIME_LOCAL < "2020-11-01 04:00:00") %>%
#view()
knitr::kable()
| TIME | ATMP | TIME_LOCAL |
|---|---|---|
| 2020-11-01 07:30:00 | 12.4 | 2020-11-01 00:30:00 |
| 2020-11-01 07:36:00 | 12.3 | 2020-11-01 00:36:00 |
| 2020-11-01 07:42:00 | 12.5 | 2020-11-01 00:42:00 |
| 2020-11-01 07:48:00 | 12.6 | 2020-11-01 00:48:00 |
| 2020-11-01 07:54:00 | 12.8 | 2020-11-01 00:54:00 |
| 2020-11-01 08:00:00 | 12.9 | 2020-11-01 01:00:00 |
| 2020-11-01 08:06:00 | 12.8 | 2020-11-01 01:06:00 |
| 2020-11-01 08:12:00 | 12.8 | 2020-11-01 01:12:00 |
| 2020-11-01 08:18:00 | 13.0 | 2020-11-01 01:18:00 |
| 2020-11-01 08:24:00 | 13.2 | 2020-11-01 01:24:00 |
| 2020-11-01 08:30:00 | 13.3 | 2020-11-01 01:30:00 |
| 2020-11-01 08:36:00 | 13.2 | 2020-11-01 01:36:00 |
| 2020-11-01 08:42:00 | 13.3 | 2020-11-01 01:42:00 |
| 2020-11-01 08:48:00 | 13.4 | 2020-11-01 01:48:00 |
| 2020-11-01 08:54:00 | 13.1 | 2020-11-01 01:54:00 |
| 2020-11-01 09:00:00 | 12.8 | 2020-11-01 01:00:00 |
| 2020-11-01 09:06:00 | 12.9 | 2020-11-01 01:06:00 |
| 2020-11-01 09:12:00 | 13.4 | 2020-11-01 01:12:00 |
| 2020-11-01 09:18:00 | NA | 2020-11-01 01:18:00 |
| 2020-11-01 09:24:00 | 13.7 | 2020-11-01 01:24:00 |
| 2020-11-01 09:30:00 | 13.5 | 2020-11-01 01:30:00 |
| 2020-11-01 09:36:00 | 13.6 | 2020-11-01 01:36:00 |
| 2020-11-01 09:42:00 | 14.0 | 2020-11-01 01:42:00 |
| 2020-11-01 09:48:00 | 13.5 | 2020-11-01 01:48:00 |
| 2020-11-01 09:54:00 | 13.4 | 2020-11-01 01:54:00 |
| 2020-11-01 10:00:00 | 13.9 | 2020-11-01 02:00:00 |
| 2020-11-01 10:06:00 | 14.1 | 2020-11-01 02:06:00 |
| 2020-11-01 10:12:00 | 14.0 | 2020-11-01 02:12:00 |
| 2020-11-01 10:18:00 | 13.6 | 2020-11-01 02:18:00 |
| 2020-11-01 10:24:00 | 13.8 | 2020-11-01 02:24:00 |
| 2020-11-01 10:30:00 | 13.7 | 2020-11-01 02:30:00 |
| 2020-11-01 10:36:00 | 13.9 | 2020-11-01 02:36:00 |
| 2020-11-01 10:42:00 | 13.6 | 2020-11-01 02:42:00 |
| 2020-11-01 10:48:00 | 13.3 | 2020-11-01 02:48:00 |
| 2020-11-01 10:54:00 | 13.3 | 2020-11-01 02:54:00 |
| 2020-11-01 11:00:00 | NA | 2020-11-01 03:00:00 |
There are two 1AM in November.. a mess..
Depending on the analysis you’re carrying you might want different time representation..
We might want to use the Solar Local Time. This make more sense when making some climatic analysis.. not when communicating with the public!
If we check again the time-zone plot: https://en.wikipedia.org/wiki/Coordinated_Universal_Time#/media/File:World_Time_Zones_Map.png
we see that San Francisco belongs to UTC-8. Use
temperature_df <- temperature_df %>%
mutate(TIME_LOCAL_SOLAR = TIME - lubridate::hours(8))
In this case we make an explicit call to the function hours() from lubridate using ::. This is a good way to call a function avoiding implicit information.
temperature_df %>%
filter(TIME_LOCAL >= "2020-03-08 01:30:00",
TIME_LOCAL < "2020-03-08 04:00:00") %>%
#view()
knitr::kable()
| TIME | ATMP | TIME_LOCAL | TIME_LOCAL_SOLAR |
|---|---|---|---|
| 2020-03-08 09:30:00 | 7.9 | 2020-03-08 01:30:00 | 2020-03-08 01:30:00 |
| 2020-03-08 09:36:00 | 7.8 | 2020-03-08 01:36:00 | 2020-03-08 01:36:00 |
| 2020-03-08 09:42:00 | 7.8 | 2020-03-08 01:42:00 | 2020-03-08 01:42:00 |
| 2020-03-08 09:48:00 | NA | 2020-03-08 01:48:00 | 2020-03-08 01:48:00 |
| 2020-03-08 09:54:00 | 7.5 | 2020-03-08 01:54:00 | 2020-03-08 01:54:00 |
| 2020-03-08 10:00:00 | 7.6 | 2020-03-08 03:00:00 | 2020-03-08 02:00:00 |
| 2020-03-08 10:06:00 | 7.3 | 2020-03-08 03:06:00 | 2020-03-08 02:06:00 |
| 2020-03-08 10:12:00 | 7.4 | 2020-03-08 03:12:00 | 2020-03-08 02:12:00 |
| 2020-03-08 10:18:00 | NA | 2020-03-08 03:18:00 | 2020-03-08 02:18:00 |
| 2020-03-08 10:24:00 | 7.3 | 2020-03-08 03:24:00 | 2020-03-08 02:24:00 |
| 2020-03-08 10:30:00 | NA | 2020-03-08 03:30:00 | 2020-03-08 02:30:00 |
| 2020-03-08 10:36:00 | 7.1 | 2020-03-08 03:36:00 | 2020-03-08 02:36:00 |
| 2020-03-08 10:42:00 | 7.3 | 2020-03-08 03:42:00 | 2020-03-08 02:42:00 |
| 2020-03-08 10:48:00 | 7.4 | 2020-03-08 03:48:00 | 2020-03-08 02:48:00 |
| 2020-03-08 10:54:00 | 7.3 | 2020-03-08 03:54:00 | 2020-03-08 02:54:00 |
temperature_df %>%
filter(TIME_LOCAL >= "2020-11-01 00:00:00",
TIME_LOCAL < "2020-11-01 04:00:00") %>%
#view()
knitr::kable()
| TIME | ATMP | TIME_LOCAL | TIME_LOCAL_SOLAR |
|---|---|---|---|
| 2020-11-01 07:00:00 | 12.5 | 2020-11-01 00:00:00 | 2020-10-31 23:00:00 |
| 2020-11-01 07:06:00 | 12.5 | 2020-11-01 00:06:00 | 2020-10-31 23:06:00 |
| 2020-11-01 07:12:00 | 12.4 | 2020-11-01 00:12:00 | 2020-10-31 23:12:00 |
| 2020-11-01 07:18:00 | NA | 2020-11-01 00:18:00 | 2020-10-31 23:18:00 |
| 2020-11-01 07:24:00 | 12.4 | 2020-11-01 00:24:00 | 2020-10-31 23:24:00 |
| 2020-11-01 07:30:00 | 12.4 | 2020-11-01 00:30:00 | 2020-10-31 23:30:00 |
| 2020-11-01 07:36:00 | 12.3 | 2020-11-01 00:36:00 | 2020-10-31 23:36:00 |
| 2020-11-01 07:42:00 | 12.5 | 2020-11-01 00:42:00 | 2020-10-31 23:42:00 |
| 2020-11-01 07:48:00 | 12.6 | 2020-11-01 00:48:00 | 2020-10-31 23:48:00 |
| 2020-11-01 07:54:00 | 12.8 | 2020-11-01 00:54:00 | 2020-10-31 23:54:00 |
| 2020-11-01 08:00:00 | 12.9 | 2020-11-01 01:00:00 | 2020-11-01 00:00:00 |
| 2020-11-01 08:06:00 | 12.8 | 2020-11-01 01:06:00 | 2020-11-01 00:06:00 |
| 2020-11-01 08:12:00 | 12.8 | 2020-11-01 01:12:00 | 2020-11-01 00:12:00 |
| 2020-11-01 08:18:00 | 13.0 | 2020-11-01 01:18:00 | 2020-11-01 00:18:00 |
| 2020-11-01 08:24:00 | 13.2 | 2020-11-01 01:24:00 | 2020-11-01 00:24:00 |
| 2020-11-01 08:30:00 | 13.3 | 2020-11-01 01:30:00 | 2020-11-01 00:30:00 |
| 2020-11-01 08:36:00 | 13.2 | 2020-11-01 01:36:00 | 2020-11-01 00:36:00 |
| 2020-11-01 08:42:00 | 13.3 | 2020-11-01 01:42:00 | 2020-11-01 00:42:00 |
| 2020-11-01 08:48:00 | 13.4 | 2020-11-01 01:48:00 | 2020-11-01 00:48:00 |
| 2020-11-01 08:54:00 | 13.1 | 2020-11-01 01:54:00 | 2020-11-01 00:54:00 |
| 2020-11-01 09:00:00 | 12.8 | 2020-11-01 01:00:00 | 2020-11-01 01:00:00 |
| 2020-11-01 09:06:00 | 12.9 | 2020-11-01 01:06:00 | 2020-11-01 01:06:00 |
| 2020-11-01 09:12:00 | 13.4 | 2020-11-01 01:12:00 | 2020-11-01 01:12:00 |
| 2020-11-01 09:18:00 | NA | 2020-11-01 01:18:00 | 2020-11-01 01:18:00 |
| 2020-11-01 09:24:00 | 13.7 | 2020-11-01 01:24:00 | 2020-11-01 01:24:00 |
| 2020-11-01 09:30:00 | 13.5 | 2020-11-01 01:30:00 | 2020-11-01 01:30:00 |
| 2020-11-01 09:36:00 | 13.6 | 2020-11-01 01:36:00 | 2020-11-01 01:36:00 |
| 2020-11-01 09:42:00 | 14.0 | 2020-11-01 01:42:00 | 2020-11-01 01:42:00 |
| 2020-11-01 09:48:00 | 13.5 | 2020-11-01 01:48:00 | 2020-11-01 01:48:00 |
| 2020-11-01 09:54:00 | 13.4 | 2020-11-01 01:54:00 | 2020-11-01 01:54:00 |
| 2020-11-01 10:00:00 | 13.9 | 2020-11-01 02:00:00 | 2020-11-01 02:00:00 |
| 2020-11-01 10:06:00 | 14.1 | 2020-11-01 02:06:00 | 2020-11-01 02:06:00 |
| 2020-11-01 10:12:00 | 14.0 | 2020-11-01 02:12:00 | 2020-11-01 02:12:00 |
| 2020-11-01 10:18:00 | 13.6 | 2020-11-01 02:18:00 | 2020-11-01 02:18:00 |
| 2020-11-01 10:24:00 | 13.8 | 2020-11-01 02:24:00 | 2020-11-01 02:24:00 |
| 2020-11-01 10:30:00 | 13.7 | 2020-11-01 02:30:00 | 2020-11-01 02:30:00 |
| 2020-11-01 10:36:00 | 13.9 | 2020-11-01 02:36:00 | 2020-11-01 02:36:00 |
| 2020-11-01 10:42:00 | 13.6 | 2020-11-01 02:42:00 | 2020-11-01 02:42:00 |
| 2020-11-01 10:48:00 | 13.3 | 2020-11-01 02:48:00 | 2020-11-01 02:48:00 |
| 2020-11-01 10:54:00 | 13.3 | 2020-11-01 02:54:00 | 2020-11-01 02:54:00 |
| 2020-11-01 11:00:00 | NA | 2020-11-01 03:00:00 | 2020-11-01 03:00:00 |
If we are doing a multi-year climatic analysis we are missing ~8 hr.. in 10 years is not a huge issue.. but take this into consideration.
summary(temperature_df)
## TIME ATMP TIME_LOCAL
## Min. :2011-01-01 00:00:00 Min. : 1.80 Min. :2010-12-31 16:00:00
## 1st Qu.:2013-06-30 11:04:30 1st Qu.:11.30 1st Qu.:2013-06-30 04:04:30
## Median :2015-12-29 13:57:00 Median :12.90 Median :2015-12-29 05:57:00
## Mean :2015-12-31 00:24:33 Mean :12.99 Mean :2015-12-30 16:24:33
## 3rd Qu.:2018-06-30 22:01:30 3rd Qu.:14.60 3rd Qu.:2018-06-30 15:01:30
## Max. :2020-12-31 23:54:00 Max. :34.40 Max. :2020-12-31 15:54:00
## NA's :34964
## TIME_LOCAL_SOLAR
## Min. :2010-12-31 16:00:00
## 1st Qu.:2013-06-30 03:04:30
## Median :2015-12-29 05:57:00
## Mean :2015-12-30 16:24:33
## 3rd Qu.:2018-06-30 14:01:30
## Max. :2020-12-31 15:54:00
##
There are no missing or duplicated hours.. as we want.
temperature_df$TIME_LOCAL_SOLAR[duplicated(as.character(temperature_df$TIME_LOCAL_SOLAR))]
## POSIXct of length 0
group() and summarise()These functions are so powerful you can even imagine..
group() is being used to group rows that contains the same valuesummarise() and its functions are used to take a vector of values to returna a single value.YEARLet’s prepare our final data-set and save it in temperature_df_solar.
temperature_df_solar <- temperature_df %>%
select(TIME_LOCAL_SOLAR, ATMP)
We will use a working data-set tmp_df and we will create a YEAR variable that we are going to use for grouping and doing some statistical analysis:
tmp_df <- temperature_df_solar %>%
mutate(YEAR = lubridate::year(TIME_LOCAL_SOLAR))
NOTE FOR FEDE: this part have to be explained slowly.. introduce group() and summarise() start without filter and only TEMP averaging and n(). Then filter() and add standard deviation etc.
tmp_df %>%
# filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
group_by(YEAR) %>%
#summarise(TIME = mean(TIME_LOCAL_SOLAR, na.rm=TRUE), # we don't really need this..
# option 1, we remove it.. option 2 below
summarise(
AVG_TEMP = mean(ATMP, na.rm=TRUE),
#SD_TEMP = sd(ATMP, na.rm=TRUE),
#SD_TEMP_PERC = SD_TEMP/AVG_TEMP*100, # this is cool
#SD_TEMP_PERC2 = sd(ATMP, na.rm=TRUE)/mean(ATMP, na.rm=TRUE)*100,
nr = n()
) %>%
glimpse()
## Rows: 11
## Columns: 3
## $ YEAR <dbl> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2…
## $ AVG_TEMP <dbl> 8.408861, 12.065400, 12.146867, 12.326561, 13.934820, 13.8483…
## $ nr <int> 79, 87239, 87382, 86889, 87265, 87056, 86751, 87212, 86505, 8…
mean(ATMP) Part1The best way to look at the data is to make some plots. We will introduce
%>% into ggplot()geom_line() to plot linesgeom_point() to plot pointsscale_x_continuous()tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
group_by(YEAR) %>%
summarise(
AVG_TEMP = mean(ATMP, na.rm=TRUE),
SD_TEMP = sd(ATMP, na.rm=TRUE),
SD_TEMP_PERC = SD_TEMP/AVG_TEMP*100,
nr = n()
) %>%
ggplot(data = ., aes(x = YEAR, y = AVG_TEMP)) +
geom_line(color="#006494") +
geom_point(size=3, color="#003554") +
# geom_point(size=3, color="red") + # red covering the blue
#geom_line() + #this cover the points! better before
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2011,2020)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#B3B0AE"))
Comment: ggplot position scales for date/time data is very nice to format the default appearance of the x-axes label: https://ggplot2.tidyverse.org/reference/scale_date.html.
mean(ATMP) Part2Adding Standard Deviation to the plot
tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
group_by(YEAR) %>%
summarise(
AVG_TEMP = mean(ATMP, na.rm=TRUE),
SD_TEMP = sd(ATMP, na.rm=TRUE),
SD_TEMP_PERC = SD_TEMP/AVG_TEMP*100,
nr = n()
) %>%
ggplot(data = .) +
geom_line(aes(x = YEAR, y = (AVG_TEMP + SD_TEMP )), color = "#006494") + # THIS THE SD
geom_line(aes(x = YEAR, y = (AVG_TEMP - SD_TEMP )), color = "#006494") + # THIS THE SD
geom_line(aes(x = YEAR, y = AVG_TEMP), color="#006494") + # THIS FOR THE AVERAGE
geom_point(aes(x = YEAR, y = AVG_TEMP), size=3, color="#003554") + # THIS FOR THE AVERAGE
#geom_line() + #this cover the points! better before
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2011,2020)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#B3B0AE"))
mean(ATMP) Part3Adding geom_segment() as an example of additional geometry. The list of ggplot geometries: https://ggplot2.tidyverse.org/reference/
tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
group_by(YEAR) %>%
summarise(
AVG_TEMP = mean(ATMP, na.rm=TRUE),
SD_TEMP = sd(ATMP, na.rm=TRUE),
SD_TEMP_PERC = SD_TEMP/AVG_TEMP*100,
T_MAX = max(ATMP, na.rm = TRUE),
T_MIN = min(ATMP, na.rm = TRUE),
nr = n()
) %>%
ggplot(data = .) +
geom_point(aes(x = YEAR, y = T_MAX), color = "#051923", bg = "#051923", pch = 24) + #ADD 1
geom_point(aes(x = YEAR, y = T_MIN), color = "#051923", bg = "#051923", pch = 25) + #ADD 1
geom_segment(aes(x = YEAR, xend = YEAR, y = T_MIN, yend = T_MAX), color = "#00A6FB") +
geom_line(aes(x = YEAR, y = (AVG_TEMP + SD_TEMP )), color = "#006494") +
geom_line(aes(x = YEAR, y = (AVG_TEMP - SD_TEMP )), color = "#006494") +
geom_line(aes(x = YEAR, y = AVG_TEMP), color="#006494") +
geom_point(aes(x = YEAR, y = AVG_TEMP), size=3, color="#003554") +
#geom_line() + #this cover the points! better before
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2011,2020)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
geom_boxplot()tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
ggplot(.) +
geom_boxplot(aes(x = YEAR, y = ATMP, group = YEAR)) +
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2010.5,2020.5)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## Warning: Removed 34964 rows containing non-finite values (stat_boxplot).
geom_violin()tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
ggplot(.) +
geom_violin(aes(x = YEAR, y = ATMP, group = YEAR)) +
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2010.5,2020.5)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## Warning: Removed 34964 rows containing non-finite values (stat_ydensity).
### Take home message
Use geometry functions to produce better (nicer, meaningful) plots.
MONTHWe have seen how powerful are group(). Let’s consider MONTH:
NOTA PER FEDE: start only with MONTH and then move to label=TRUE
tmp_df <- temperature_df_solar %>%
filter(TIME_LOCAL_SOLAR > lubridate::ymd("2011-01-01")) %>%
mutate(YEAR = lubridate::year(TIME_LOCAL_SOLAR),
MONTH = lubridate::month(TIME_LOCAL_SOLAR),
MONTH_NAME = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
MONTH_ABBR = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE),
) %>%
glimpse()
## Rows: 870,424
## Columns: 6
## $ TIME_LOCAL_SOLAR <dttm> 2011-01-01 00:06:00, 2011-01-01 00:12:00, 2011-01-01…
## $ ATMP <dbl> 9.2, 9.1, 9.0, 8.9, 8.8, 8.8, 8.8, 8.7, 8.8, 8.8, 8.7…
## $ YEAR <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MONTH_NAME <ord> January, January, January, January, January, January,…
## $ MONTH_ABBR <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan…
Here we will want to see how the Temperature distribution of a month is changing (or not) over the years.
tmp_df %>%
filter(MONTH == 1) %>% # change from 1 to 12
ggplot(.) +
geom_boxplot(aes(x = YEAR, y = ATMP, group = YEAR) ) +
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2010.5,2020.5)) +
scale_y_continuous(limits = c(0,35)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## Warning: Removed 1354 rows containing non-finite values (stat_boxplot).
Climatic analysis over 10 yrs. Each boxplot contains data of all a month, for example all January months.
Note: using levels() for the labels() is a good option!
tmp_df %>%
ggplot(.) +
geom_boxplot(aes(x = MONTH, y = ATMP, group = MONTH) ) +
#scale_x_continuous(breaks = seq(1,12,1), limits = c(0.5,12.5)) +
scale_x_continuous(breaks = seq(1,12,1), limits = c(0.5,12.5), labels = c(levels(tmp_df$MONTH_ABBR))) + #https://ggplot2.tidyverse.org/reference/scale_continuous.html
scale_y_continuous(limits = c(0,35)) +
xlab(element_blank()) +
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## Warning: Removed 34964 rows containing non-finite values (stat_boxplot).
tmp_df %>%
group_by(MONTH) %>%
summarise(YEAR = YEAR,
MONTH = MONTH,
T_AVG = median(ATMP, na.rm = TRUE),
n = n()
) %>%
glimpse() %>%
ggplot(.) +
geom_point(aes(x = MONTH, y = T_AVG, group = MONTH)) +
scale_x_continuous(breaks = seq(1,12,1), limits = c(0.5,12.5), labels = c(levels(tmp_df$MONTH_ABBR))) +
scale_y_continuous(limits = c(0,35)) +
xlab(element_blank()) +
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## `summarise()` has grouped output by 'MONTH'. You can override using the
## `.groups` argument.
## Rows: 870,424
## Columns: 4
## Groups: MONTH [12]
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ YEAR <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ T_AVG <dbl> 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1…
## $ n <int> 73664, 73664, 73664, 73664, 73664, 73664, 73664, 73664, 73664, 7…
geom_line() plotstmp_df %>%
group_by(YEAR, MONTH) %>%
summarise(#TIME = lubridate::floor_date(x = mean(TIME_LOCAL_SOLAR, na.rm=TRUE), unit = "day"), #Not in class
TEMPERATURE = mean(ATMP, na.rm = T)
) %>%
#ungroup() %>% # Not in class
#filter(YEAR == 2011) %>% # EXAMPLE IN CLASS
glimpse() %>%
ggplot(data = ., aes(x = MONTH, y = TEMPERATURE, group = as.factor(YEAR), color=as.factor(YEAR))) +
geom_line() +
scale_x_continuous(breaks = seq(1,12,1), limits = c(0.5,12.5), labels = c(levels(tmp_df$MONTH_ABBR))) +
scale_y_continuous(limits = c(0,35)) +
xlab(element_blank()) +
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## `summarise()` has grouped output by 'YEAR'. You can override using the `.groups`
## argument.
## Rows: 120
## Columns: 3
## Groups: YEAR [10]
## $ YEAR <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ MONTH <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7…
## $ TEMPERATURE <dbl> 9.779760, 9.923522, 11.237834, 11.406021, 11.705405, 12.74…
SEASON & case_when()Let’s create a new grouping.. just to play! And we can learn a new function!
tmp_df <- tmp_df %>%
mutate(SEASON = case_when( MONTH %in% c(1:2,12) ~ "Winter",
MONTH %in% c(3:5) ~ "Spring",
MONTH %in% c(6:8) ~ "Summer",
MONTH %in% c(9:11) ~ "Fall"
))
… also.. do you remember the %in% operator!?
tmp_df %>%
filter(YEAR == 2018) %>%
ggplot() +
geom_boxplot(aes(x = SEASON, y = ATMP, group = SEASON))
## Warning: Removed 3932 rows containing non-finite values (stat_boxplot).
There is something wrong here…. we need factors!
factor()Remember that factors are data structures in R that store categorical data. In datasets, there are often fields that take only a few predefined values. For example: gender, availability, country, marital status, etc.
tmp_df <- tmp_df %>%
mutate(SEASON = factor(SEASON, levels = c("Winter",
"Spring",
"Summer",
"Fall")
)
)
And plot again:
tmp_df %>%
filter(YEAR == 2018) %>%
ggplot() +
geom_boxplot(aes(x = SEASON, y = ATMP, group = SEASON))
## Warning: Removed 3932 rows containing non-finite values (stat_boxplot).
Wait a minute… Fall temperature is higher than Summer?
tmp_df %>%
filter(YEAR == 2018) %>%
group_by(MONTH) %>%
summarise(T_AVG = mean(ATMP, na.rm = TRUE)
) %>%
#view()
knitr::kable()
| MONTH | T_AVG |
|---|---|
| 1 | 11.47157 |
| 2 | 11.33772 |
| 3 | 11.19353 |
| 4 | 12.14815 |
| 5 | 12.47028 |
| 6 | 13.24482 |
| 7 | 14.02146 |
| 8 | 13.89349 |
| 9 | 14.02653 |
| 10 | 14.91397 |
| 11 | 13.76763 |
| 12 | 11.77784 |
The Indian Summer is real! https://en.wikipedia.org/wiki/Indian_summer
DAYWe are reaching the top! Create a DAY variable:
tmp_df <- temperature_df_solar %>%
filter(TIME_LOCAL_SOLAR > lubridate::ymd("2011-01-01")) %>%
mutate(YEAR = lubridate::year(TIME_LOCAL_SOLAR),
MONTH = lubridate::month(TIME_LOCAL_SOLAR),
MONTH_NAME = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
MONTH_ABBR = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = TRUE),
DAY = lubridate::day(TIME_LOCAL_SOLAR),
DAY_NAME = lubridate::wday(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
DAY_ABBR = lubridate::wday(TIME_LOCAL_SOLAR, label = TRUE, abbr = TRUE),
WEEKDAY = lubridate::wday(TIME_LOCAL_SOLAR),
SEASON = case_when( MONTH %in% c(1:2,12) ~ "Winter",
MONTH %in% c(3:5) ~ "Spring",
MONTH %in% c(6:8) ~ "Summer",
MONTH %in% c(9:11) ~ "Fall"
)
) %>%
mutate(SEASON = factor(SEASON, levels = c("Winter",
"Spring",
"Summer",
"Fall")
)
) %>%
glimpse()
## Rows: 870,424
## Columns: 11
## $ TIME_LOCAL_SOLAR <dttm> 2011-01-01 00:06:00, 2011-01-01 00:12:00, 2011-01-01…
## $ ATMP <dbl> 9.2, 9.1, 9.0, 8.9, 8.8, 8.8, 8.8, 8.7, 8.8, 8.8, 8.7…
## $ YEAR <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MONTH_NAME <ord> January, January, January, January, January, January,…
## $ MONTH_ABBR <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan…
## $ DAY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY_NAME <ord> Saturday, Saturday, Saturday, Saturday, Saturday, Sat…
## $ DAY_ABBR <ord> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat…
## $ WEEKDAY <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
## $ SEASON <fct> Winter, Winter, Winter, Winter, Winter, Winter, Winte…
Let’s compare January’s days over the 10 years! (.. if it makes sense..)
tmp_df %>%
filter(MONTH == 1) %>%
group_by(YEAR,DAY) %>%
summarise(TIME = lubridate::floor_date(x = mean(TIME_LOCAL_SOLAR, na.rm=TRUE), unit = "day"),
TEMPERATURE = mean(ATMP, na.rm = T)
) %>%
ungroup() %>%
#filter(YEAR == 2011) %>%
ggplot(data = ., aes(x = DAY, y = TEMPERATURE, group = as.factor(YEAR), color=as.factor(YEAR))) +
geom_line() +
scale_x_continuous(breaks = seq(1,31,1), limits = c(0.5,31.5)) +
scale_y_continuous(limits = c(0,35)) +
xlab(element_blank()) +
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
## `summarise()` has grouped output by 'YEAR'. You can override using the `.groups`
## argument.
HOURWe are finally got the queen! Make an hourly average :D .. will take some time
tmp_df <- temperature_df_solar %>%
filter(TIME_LOCAL_SOLAR > lubridate::ymd("2011-01-01")) %>%
mutate(YEAR = lubridate::year(TIME_LOCAL_SOLAR),
MONTH = lubridate::month(TIME_LOCAL_SOLAR),
MONTH_NAME = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
MONTH_ABBR = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = TRUE),
DAY = lubridate::day(TIME_LOCAL_SOLAR),
DAY_NAME = lubridate::wday(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
DAY_ABBR = lubridate::wday(TIME_LOCAL_SOLAR, label = TRUE, abbr = TRUE),
WEEKDAY = lubridate::wday(TIME_LOCAL_SOLAR),
HOUR = lubridate::hour(TIME_LOCAL_SOLAR) # here!
) %>%
glimpse()
## Rows: 870,424
## Columns: 11
## $ TIME_LOCAL_SOLAR <dttm> 2011-01-01 00:06:00, 2011-01-01 00:12:00, 2011-01-01…
## $ ATMP <dbl> 9.2, 9.1, 9.0, 8.9, 8.8, 8.8, 8.8, 8.7, 8.8, 8.8, 8.7…
## $ YEAR <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011,…
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MONTH_NAME <ord> January, January, January, January, January, January,…
## $ MONTH_ABBR <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan…
## $ DAY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY_NAME <ord> Saturday, Saturday, Saturday, Saturday, Saturday, Sat…
## $ DAY_ABBR <ord> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat…
## $ WEEKDAY <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
## $ HOUR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
tmp_df_1H <- temperature_df_solar %>%
filter(TIME_LOCAL_SOLAR > lubridate::ymd("2011-01-01")) %>%
mutate(YEAR = lubridate::year(TIME_LOCAL_SOLAR),
MONTH = lubridate::month(TIME_LOCAL_SOLAR),
MONTH_NAME = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
MONTH_ABBR = lubridate::month(TIME_LOCAL_SOLAR, label = TRUE, abbr = TRUE),
DAY = lubridate::day(TIME_LOCAL_SOLAR),
DAY_NAME = lubridate::wday(TIME_LOCAL_SOLAR, label = TRUE, abbr = FALSE),
DAY_ABBR = lubridate::wday(TIME_LOCAL_SOLAR, label = TRUE, abbr = TRUE),
WEEKDAY = lubridate::wday(TIME_LOCAL_SOLAR),
HOUR = lubridate::hour(TIME_LOCAL_SOLAR)
) %>%
group_by(YEAR, MONTH, DAY, HOUR) %>%
summarise(TIME_HR = lubridate::ceiling_date(x = mean(TIME_LOCAL_SOLAR, na.rm=TRUE), unit = "hour"),
ATMP_AVG = mean(ATMP, na.rm = TRUE),
NR_OBS = n()
) %>%
ungroup() %>%
print(n=10)
## `summarise()` has grouped output by 'YEAR', 'MONTH', 'DAY'. You can override
## using the `.groups` argument.
## # A tibble: 87,367 × 7
## YEAR MONTH DAY HOUR TIME_HR ATMP_AVG NR_OBS
## <dbl> <dbl> <int> <int> <dttm> <dbl> <int>
## 1 2011 1 1 0 2011-01-01 01:00:00 8.9 9
## 2 2011 1 1 1 2011-01-01 02:00:00 8.66 10
## 3 2011 1 1 2 2011-01-01 03:00:00 8.44 10
## 4 2011 1 1 3 2011-01-01 04:00:00 8.51 10
## 5 2011 1 1 4 2011-01-01 05:00:00 7.75 10
## 6 2011 1 1 5 2011-01-01 06:00:00 6.98 10
## 7 2011 1 1 6 2011-01-01 07:00:00 6.72 10
## 8 2011 1 1 7 2011-01-01 08:00:00 6.39 10
## 9 2011 1 1 8 2011-01-01 09:00:00 6.39 10
## 10 2011 1 1 9 2011-01-01 10:00:00 6.62 10
## # … with 87,357 more rows
Maybe you can’t appreciate now how beauty and “simple” this is…
tmp_df_1H %>%
filter(TIME_HR >= "2018-01-01",
TIME_HR <= "2018-01-31") %>%
ggplot(data = ., mapping = aes(x = TIME_HR, y = ATMP_AVG)) +
geom_line() +
geom_point()
MONTHlibrary(patchwork)
tmp_df %>%
group_by(YEAR, MONTH_ABBR) %>%
summarise(TEMPERATURE = median(ATMP, na.rm = TRUE)) %>%
filter(MONTH_ABBR == "Jan") %>%
ungroup() %>%
glimpse() %>%
ggplot(data = ., aes(x = YEAR, y = TEMPERATURE)) +
geom_smooth(method = "lm") +
geom_point() +
ggpmisc::stat_poly_eq(formula = y ~ x,
parse = TRUE, label.x = "left", label.y = "top",
aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~"))) +
plot_annotation(title = paste0("Temperature trend ", "January" ),
#subtitle = "Number of thermostat events in California homes during a wildfire episode; days with more events have darker shading",
theme = theme(plot.title = element_text(size = 14, colour = "grey20", face = "bold", hjust = 0),
plot.subtitle = element_text(size = 10, colour = "grey20", face = "italic", hjust = 0, margin = margin(b = 10)),
plot.background = element_rect(fill = "white"))) +
ylab("Temperature (°C)") +
xlab(element_blank())
## `summarise()` has grouped output by 'YEAR'. You can override using the `.groups`
## argument.
## Rows: 10
## Columns: 3
## $ YEAR <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020
## $ MONTH_ABBR <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan
## $ TEMPERATURE <dbl> 10.10, 10.50, 9.70, 11.80, 11.90, 11.50, 9.70, 11.55, 12.0…
## `geom_smooth()` using formula 'y ~ x'
Some plot for the wind direction. First let’s create the data-set.
wind_df <- meteo_df %>%
select(TIME, WDIR, WSPD) %>%
glimpse() %>%
mutate(TIME = lubridate::ymd_hms(TIME, tz = "UTC") ) %>%
glimpse()
## Rows: 870,504
## Columns: 3
## $ TIME <chr> "2011-01-01 00:00:00", "2011-01-01 00:06:00", "2011-01-01 00:12:0…
## $ WDIR <dbl> 3, 3, 1, 8, 8, 12, 11, 19, 8, 13, 7, 358, 358, 14, 7, 8, 12, 12, …
## $ WSPD <dbl> 4.3, 4.3, 4.6, 4.4, 4.4, 3.9, 3.2, 3.4, 3.1, 3.5, 3.9, 4.1, 4.1, …
## Rows: 870,504
## Columns: 3
## $ TIME <dttm> 2011-01-01 00:00:00, 2011-01-01 00:06:00, 2011-01-01 00:12:00, 2…
## $ WDIR <dbl> 3, 3, 1, 8, 8, 12, 11, 19, 8, 13, 7, 358, 358, 14, 7, 8, 12, 12, …
## $ WSPD <dbl> 4.3, 4.3, 4.6, 4.4, 4.4, 3.9, 3.2, 3.4, 3.1, 3.5, 3.9, 4.1, 4.1, …
wind_df %>%
filter(TIME >= "2020-01-01",
TIME < "2020-01-02") %>%
mutate(YEAR = lubridate::year(TIME),
MONTH = lubridate::month(TIME),
MONTH_NAME = lubridate::month(TIME, label = TRUE, abbr = FALSE),
MONTH_ABBR = lubridate::month(TIME, label = TRUE, abbr = TRUE),
DAY = lubridate::day(TIME),
DAY_NAME = lubridate::wday(TIME, label = TRUE, abbr = FALSE),
DAY_ABBR = lubridate::wday(TIME, label = TRUE, abbr = TRUE),
WEEKDAY = lubridate::wday(TIME),
HOUR = lubridate::hour(TIME)
) %>%
glimpse() %>%
group_by(HOUR) %>%
summarise(TIME = round_date(mean(TIME, na.rm=TRUE), unit = "hour"),
WSPD = mean(WSPD, na.rm = TRUE),
WDIR = (360+(atan2(sum(sin(WDIR*pi/180)),sum(cos(WDIR*pi/180)))*180/pi))%%360
) %>%
ungroup() %>%
mutate(WIND_SECTOR = case_when( WDIR > 0 & WDIR <= 22.5 ~ "North",
WDIR > 22.5 & WDIR <= 67.5 ~ "North-East",
WDIR > 67.5 & WDIR <= 112.5 ~ "East",
WDIR > 112.5 & WDIR <= 157.5 ~ "South-East",
WDIR > 157.5 & WDIR <= 202.5 ~ "South",
WDIR > 202.5 & WDIR <= 247.5 ~ "South-West",
WDIR > 247.5 & WDIR <= 292.5 ~ "West",
WDIR > 292.5 & WDIR <= 337.5 ~ "North-West",
WDIR > 337.5 & WDIR <= 360 ~ "North")
) %>%
glimpse() %>%
ggplot(., mapping = aes(x = TIME, y = WSPD)) +
geom_segment(aes(x = HOUR,
#geom_segment(aes(x = lubridate::hour(TIME),
#geom_segment(aes(x = TIME,
y = 0,
# xend = HOUR - lubridate::dhours(1 * -cos((90-WDIR) / 360 * 2 * pi)), # NORMALIZE TO 1
# yend = -1 * (1 * -sin((90-WDIR) / 360 * 2 * pi))#, col = factor(mean_WS_ms) # NORMALIZE TO 1
xend = HOUR - lubridate::dhours(WSPD * -cos((90-WDIR) / 360 * 2 * pi)), # ADDING LENGTH AS SPEED
yend = - 1 * (WSPD * -sin((90-WDIR) / 360 * 2 * pi)) # ADDING LENGTH AS SPEED
),
arrow = arrow(length = unit(0.2, "cm"), type = "closed"),
colour = RColorBrewer::brewer.pal(name = "BuPu", n = 9)[7]) +
geom_point(aes(HOUR, 0), size = 1, colour = RColorBrewer::brewer.pal(name = "BuPu", n = 9)[7]) +
coord_fixed(3600) +
theme(legend.position = "none") +
ylab("WD") +
theme_minimal(base_size = 20) +
theme(axis.text.x = element_text(angle = 20), axis.title.x = element_blank(),
axis.text.y = element_text(colour = "white"),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
plot.background = element_rect(fill = "transparent", colour = NA))
## Rows: 240
## Columns: 12
## $ TIME <dttm> 2020-01-01 08:00:00, 2020-01-01 08:06:00, 2020-01-01 08:12…
## $ WDIR <dbl> 269, 315, 246, 172, 103, 105, 267, 41, 74, 73, 86, 118, 200…
## $ WSPD <dbl> 2.5, 1.1, 1.9, 0.9, 1.2, 1.5, 1.3, 2.4, 0.8, 0.7, 0.7, 0.9,…
## $ YEAR <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,…
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MONTH_NAME <ord> January, January, January, January, January, January, Janua…
## $ MONTH_ABBR <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan,…
## $ DAY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY_NAME <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wednesday, Wedn…
## $ DAY_ABBR <ord> Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed,…
## $ WEEKDAY <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ HOUR <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
## Rows: 24
## Columns: 5
## $ HOUR <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ TIME <dttm> 2020-01-02 00:00:00, 2020-01-02 01:00:00, 2020-01-02 02:0…
## $ WSPD <dbl> 3.97, 3.33, 2.67, 2.67, 4.46, 3.78, 3.26, 1.96, 1.43, 1.31…
## $ WDIR <dbl> 238.48146, 242.40312, 237.22234, 242.44388, 245.29992, 244…
## $ WIND_SECTOR <chr> "South-West", "South-West", "South-West", "South-West", "S…
library(clifro)
library(openair)
## Warning: package 'openair' was built under R version 4.0.5
Prepare the dataframe:
wind_df_WR <- wind_df %>%
filter(TIME >= "2020-01-01",
TIME < "2020-01-02") %>%
mutate(YEAR = lubridate::year(TIME),
MONTH = lubridate::month(TIME),
MONTH_NAME = lubridate::month(TIME, label = TRUE, abbr = FALSE),
MONTH_ABBR = lubridate::month(TIME, label = TRUE, abbr = TRUE),
DAY = lubridate::day(TIME),
DAY_NAME = lubridate::wday(TIME, label = TRUE, abbr = FALSE),
DAY_ABBR = lubridate::wday(TIME, label = TRUE, abbr = TRUE),
WEEKDAY = lubridate::wday(TIME),
HOUR = lubridate::hour(TIME)
) %>%
glimpse() %>%
group_by(HOUR) %>%
summarise(TIME = round_date(mean(TIME, na.rm=TRUE), unit = "hour"),
WSPD = mean(WSPD, na.rm = TRUE),
WDIR = (360+(atan2(sum(sin(WDIR*pi/180)),sum(cos(WDIR*pi/180)))*180/pi))%%360
) %>%
ungroup() %>%
mutate(WIND_SECTOR = case_when( WDIR > 0 & WDIR <= 22.5 ~ "North",
WDIR > 22.5 & WDIR <= 67.5 ~ "North-East",
WDIR > 67.5 & WDIR <= 112.5 ~ "East",
WDIR > 112.5 & WDIR <= 157.5 ~ "South-East",
WDIR > 157.5 & WDIR <= 202.5 ~ "South",
WDIR > 202.5 & WDIR <= 247.5 ~ "South-West",
WDIR > 247.5 & WDIR <= 292.5 ~ "West",
WDIR > 292.5 & WDIR <= 337.5 ~ "North-West",
WDIR > 337.5 & WDIR <= 360 ~ "North")
) %>%
glimpse()
## Rows: 240
## Columns: 12
## $ TIME <dttm> 2020-01-01 08:00:00, 2020-01-01 08:06:00, 2020-01-01 08:12…
## $ WDIR <dbl> 269, 315, 246, 172, 103, 105, 267, 41, 74, 73, 86, 118, 200…
## $ WSPD <dbl> 2.5, 1.1, 1.9, 0.9, 1.2, 1.5, 1.3, 2.4, 0.8, 0.7, 0.7, 0.9,…
## $ YEAR <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,…
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MONTH_NAME <ord> January, January, January, January, January, January, Janua…
## $ MONTH_ABBR <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan,…
## $ DAY <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY_NAME <ord> Wednesday, Wednesday, Wednesday, Wednesday, Wednesday, Wedn…
## $ DAY_ABBR <ord> Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed, Wed,…
## $ WEEKDAY <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ HOUR <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
## Rows: 24
## Columns: 5
## $ HOUR <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ TIME <dttm> 2020-01-02 00:00:00, 2020-01-02 01:00:00, 2020-01-02 02:0…
## $ WSPD <dbl> 3.97, 3.33, 2.67, 2.67, 4.46, 3.78, 3.26, 1.96, 1.43, 1.31…
## $ WDIR <dbl> 238.48146, 242.40312, 237.22234, 242.44388, 245.29992, 244…
## $ WIND_SECTOR <chr> "South-West", "South-West", "South-West", "South-West", "S…
Plot:
with(wind_df_WR, windrose(speed = WSPD,
direction = WDIR,
speed_cuts = c(2,4,6,8),
ggtheme = "bw"
)
)
library(patchwork) # at the end for extra time
p1 <- tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
group_by(YEAR) %>%
summarise(
AVG_TEMP = mean(ATMP, na.rm=TRUE),
SD_TEMP = sd(ATMP, na.rm=TRUE),
SD_TEMP_PERC = SD_TEMP/AVG_TEMP*100,
T_MAX = max(ATMP, na.rm = TRUE),
T_MIN = min(ATMP, na.rm = TRUE),
nr = n()
) %>%
ggplot(data = .) +
geom_point(aes(x = YEAR, y = T_MAX), color = "#051923", bg = "#051923", pch = 24) + #ADD 1
geom_point(aes(x = YEAR, y = T_MIN), color = "#051923", bg = "#051923", pch = 25) + #ADD 1
geom_segment(aes(x = YEAR, xend = YEAR, y = T_MIN, yend = T_MAX), color = "#00A6FB") +
geom_line(aes(x = YEAR, y = (AVG_TEMP + SD_TEMP )), color = "#006494") +
geom_line(aes(x = YEAR, y = (AVG_TEMP - SD_TEMP )), color = "#006494") +
geom_line(aes(x = YEAR, y = AVG_TEMP), color="#006494") +
geom_point(aes(x = YEAR, y = AVG_TEMP), size=3, color="#003554") +
#geom_line() + #this cover the points! better before
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2011,2020)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
p2 <- tmp_df %>%
filter(TIME_LOCAL_SOLAR >= lubridate::ymd("2011-01-01")) %>%
ggplot(.) +
geom_boxplot(aes(x = YEAR, y = ATMP, group = YEAR)) +
scale_x_continuous(breaks = seq(2011,2020,1), limits = c(2010.5,2020.5)) +
xlab(element_blank()) + #https://ggplot2.tidyverse.org/reference/theme.html
ylab("Atmospheric Temperature (°C)") +
theme_classic() +
theme(panel.grid.major = element_line(colour = "#CCCCCC"))
p1 / p2
## Warning: Removed 34964 rows containing non-finite values (stat_boxplot).
#p1 + p2
#p1 | p2